Clinical genetic testing returns results that are categorized by testing labs into one of 5 categories: Benign Likely benign Variant of Uncertain Significance (VUS) Likely Pathogenic Pathogenic These classifications can change over time. We do not know how frequently variants change classification. We do not know if the rate at which variants are reclassified is different amongst different ancestry groups. We do not know if the rate at which variants are reclassified is different across different disease categories. The goal of this project will be to discover these differences across all of ClinVar.”
#read in data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(purrr)
library(fs)
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(arrow)
## Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
##
## Attaching package: 'arrow'
##
## The following object is masked from 'package:lubridate':
##
## duration
##
## The following object is masked from 'package:utils':
##
## timestamp
library(ggpie)
library(ggplot2)
library(reactable)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library("ggrepel")
library(corrplot)
## corrplot 0.92 loaded
library(ggeffects)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
merged_csv %>% distinct(variation_id) %>% nrow()
## [1] 2278097
merged_csv %>% distinct(submitted_classification, variation_id) %>% group_by(submitted_classification) %>% summarise(count=n())
merged_csv %>% distinct(method_type, variation_id) %>% group_by(method_type) %>% summarise(count=n()) %>% arrange(desc(count))
we added CLINVAR disease information and gnomad ancestry frequencies
The GNOMAD data were filtered by selecting only the allele frequency (AF) for the major ancestry groups, where the ancestry group with the max AF was selected for each variant. The GNOMAD data includes gene information, which we did not need to add. For the CLINVAR data, we calculated the first date of occurence for a submission and the last, i.e the first time a variant occured, and then the last occurence.
The structure of the data (CLINVAR merged with GNOMAD) should haveone variant per line per submission (date of occurence), but the trait of disease /phenotype associated can make it multi-level. To fix this, do: 8. collapse the multiple traits that are assigned to each variant per submission date, into a parent trait per line. We used the MESH term hierarchical structure as a selection guide.
We collapsed the traits into one line per variant and per submission. Select a second term from the root of the MESH terms; and then decide on what patterns of variant reclassifications are categorized as “upgrade” or “downgrade”, and integrate this information into the data. We wrote a function in R to take the processed CLINVAR data, and annotate it with the last or root MESH terms for traits.
We annotated the processed CLINVAR data with Dr. Drivas’ manual classification of a submitted variation such as : “Conflicting/Complex”, “Downgrade Path to Benign”, “Downgrade Path to VUS”, “Downgrade VUS to Benign”, “No Change”, “Upgrade Benign to Path”, “Upgrade Benign to VUS”, and “Upgrade VUS to Path”.
We added Dr. Drivas’s manually curated classification with the filtered and pre-processed CLINVAR data
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing = readRDS("/project/drivas_shared/projects/CLINVAR/rds_obj/clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing.rds")
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>% distinct(variation_id) %>% nrow()
## [1] 1054672
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged)) %>%
# mutate(bin_change = if_else(Classification == "No Change", "No Change", "Change")) %>%
group_by(class_merged, ethnic_group_AFmax) %>%
summarise(count=n()) %>%
ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = class_merged)) +
geom_bar(stat="identity") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90), legend.title=element_blank()) +
ylab(label = "percent count") +
xlab(label = "classification") +
labs(color=NULL)
## `summarise()` has grouped output by 'class_merged'. You can override using the
## `.groups` argument.
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
# mutate(bin_change = if_else(Classification == "No Change", "No Change", "Change")) %>%
group_by(class_merged, ethnic_group_AFmax) %>%
summarise(count=n()) %>%
ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = class_merged)) +
geom_bar(stat="identity") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90), legend.title=element_blank()) +
ylab(label = "percent count") +
xlab(label = "classification") +
labs(color=NULL)
## `summarise()` has grouped output by 'class_merged'. You can override using the
## `.groups` argument.
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
group_by(ethnic_group_AFmax,class_merged) %>%
filter(!is.na(class_merged)) %>%
mutate(bin_change = if_else(class_merged == "No Change", "0", "1")) %>%
summarise(count=n()) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = class_merged)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 90)) +
ylab(label = "percent count") +
xlab(label = "ancestry")
## `summarise()` has grouped output by 'ethnic_group_AFmax'. You can override
## using the `.groups` argument.
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged)) %>%
mutate(bin_change = if_else(class_merged == "No Change", "No Change", "Change")) %>%
group_by(ethnic_group_AFmax,bin_change) %>%
summarise(count=n()) %>%
ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
ggplot(aes(x = ethnic_group_AFmax, y = percent_count, fill = bin_change)) +
geom_bar(stat="identity") +
theme_classic() +
theme(axis.text.x = element_text(angle = 90), legend.title=element_blank()) +
ylab(label = "percent count") +
xlab(label = "ancestry") +
labs(color=NULL)
## `summarise()` has grouped output by 'ethnic_group_AFmax'. You can override
## using the `.groups` argument.
recode reclassification
we recode the classification, where “No Change” = “0”,and and change = “1”
glm_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(variation_id, .keep_all = TRUE) %>%
filter(!is.na(class_merged)) %>%
mutate(bin_change = if_else(class_merged == "No Change", "0", "1")) %>%
spread(key = ethnic_group_AFmax, value = ethnic_group_AFmax_value) %>%
select(bin_change, AF_afr, AF_amr, AF_asj, AF_eas, AF_fin, AF_nfe, AF_oth, AF_sas) %>%
mutate(bin_change = as.numeric(bin_change))
we recode the reclassification, where “No change” is removed
glm_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(variation_id, .keep_all = TRUE) %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
# mutate(bin_change = if_else(Classification == "No Change", "0", "1")) %>%
spread(key = ethnic_group_AFmax, value = ethnic_group_AFmax_value) %>%
select(class_merged, AF_afr, AF_amr, AF_asj, AF_eas, AF_fin, AF_nfe, AF_oth, AF_sas)
# mutate(bin_change = as.numeric(bin_change))
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
ggplot(aes(x=ethnic_group_AFmax_value, color=ethnic_group_AFmax)) +
geom_density() +
scale_y_continuous(trans="log10", name="density (log10)") +
labs(x = "ancestry group AF max") +
theme_classic() + guides(color = guide_legend(title = "ancestry groups"))
## Warning: Transformation introduced infinite values in continuous y-axis
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
ggplot(aes(x=AF_value, color=AF_ethnic_group)) +
geom_density() +
scale_y_continuous(trans="log10", name="density (log10)") +
labs(x = "max AF") +
theme_classic() +
theme(legend.position = "none")
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(last_mesh_term, variation_id) %>%
group_by(last_mesh_term) %>%
summarise(count=n()) %>%
arrange(desc(count))
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(pattern, variation_id) %>%
group_by(pattern) %>%
summarise(count=n()) %>%
arrange(desc(count))
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(medgen_name, variation_id) %>%
group_by(medgen_name) %>%
summarise(count=n()) %>%
arrange(desc(count))
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(trait_type, variation_id) %>%
group_by(trait_type) %>%
summarise(count=n()) %>%
arrange(desc(count))
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(medgen_name, class_merged, variation_id) %>%
group_by(medgen_name, class_merged) %>%
summarise(count=n()) %>% arrange(desc(count))
## `summarise()` has grouped output by 'medgen_name'. You can override using the
## `.groups` argument.
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
distinct(ethnic_group_AFmax, last_mesh_term) %>%
group_by(ethnic_group_AFmax, last_mesh_term) %>%
summarise(count=n()) %>%
arrange(desc(count))
## `summarise()` has grouped output by 'ethnic_group_AFmax'. You can override
## using the `.groups` argument.
pie_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
group_by(submitted_classification) %>%
summarise(count=n()) %>%
ungroup() %>%
mutate(perc = count/sum(count)) %>%
arrange(perc) %>%
mutate(labels = scales::percent(perc))
# Create a basic bar
pie = ggplot(pie_dat, aes(x="", y=perc, fill=submitted_classification)) + geom_bar(stat="identity", width=1)
# Convert to pie (polar coordinates) and add labels
pie = pie + coord_polar("y", start=0) + geom_text(aes(label = paste0(round(perc*100), "%")), position = position_stack(vjust = 0.5)) + theme_void()
# Add color scale (hex colors)
pie = pie + scale_fill_manual(values=c("#55DDE0", "#33658A", "#2F4858", "#F6AE2D", "#F26419", "#999999"))
# Remove labels and add title
pie = pie + labs(x = NULL, y = NULL, fill = NULL, title = "submitted variant classification")
# Tidy up the theme
pie = pie + theme_classic() + theme(axis.line = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
plot.title = element_text(hjust = 0.5, color = "#666666"))
pie
##prepare count data
chisq_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(Class = case_when(class_merged == "Upgrade VUS to Path" ~ "Up VUS-Path",
class_merged == "Upgrade Benign to VUS" ~ "Up Benign-VUS",
class_merged == "Upgrade Benign to Path" ~ "U Benign-Path",
class_merged == "Downgrade VUS to Benign" ~ "Down VUS-Benign",
class_merged == "Downgrade Path to VUS" ~ "Down Path-VUS",
class_merged == "Downgrade Path to Benign" ~ "Down Path-Benign",
# Classification == "Downgrade Path to VUS" ~ "Downgrade Path to VUS",
TRUE ~ "Conflicting/Complex")
) %>%
group_by(class_merged, ethnic_group_AFmax) %>%
summarise(count=n()) %>%
as.data.frame() %>%
spread(key = class_merged, value = count)
## `summarise()` has grouped output by 'class_merged'. You can override using the
## `.groups` argument.
rownames(chisq_dat) = chisq_dat$ethnic_group_AFmax
chisq_dat$ethnic_group_AFmax = NULL
chisq_dat_mat = as.matrix(chisq_dat)
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
dt <- as.table(as.matrix(chisq_dat))
balloonplot(t(dt), main ="count of variants with change in classification by ancestry", xlab ="", ylab="",label = FALSE, show.margins = FALSE, zlab = FALSE)
chisq_dat_mat
## Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr 867 219 772
## AF_amr 499 164 1141
## AF_asj 1599 251 272
## AF_eas 984 137 531
## AF_fin 844 173 206
## AF_nfe 2637 384 2262
## AF_oth 713 140 633
## AF_sas 983 205 882
## Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr 39392 107 7075
## AF_amr 20770 48 3988
## AF_asj 16661 74 3371
## AF_eas 20990 289 3986
## AF_fin 7067 108 1331
## AF_nfe 41109 151 8143
## AF_oth 15275 46 3077
## AF_sas 23239 55 4184
## Upgrade VUS to Path
## AF_afr 2399
## AF_amr 2251
## AF_asj 798
## AF_eas 1649
## AF_fin 646
## AF_nfe 6399
## AF_oth 1502
## AF_sas 2107
The chi-square test of independence is used to analyze the frequency table (i.e. contengency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables.
Chi-square test examines whether rows and columns of a contingency table are statistically significantly associated.
Null hypothesis (H0): the row and the column variables of the contingency table are independent.
Alternative hypothesis (H1): row and column variables are dependent
For each cell of the table, we have to calculate the expected value under null hypothesis.
For a given cell, the expected value is calculated as follow:
e=row.sum∗col.sum/grand.total
The Chi-square statistic is calculated as follow:
χ2=∑(o−e)2e
o is the observed value
e is the expected value
This calculated Chi-square statistic is compared to the critical value (obtained from statistical tables) with df=(r−1)(c−1)
degrees of freedom and p = 0.05.
r is the number of rows in the contingency table
c is the number of column in the contingency table
If the calculated Chi-square statistic is greater than the critical value, then we must conclude that the row and the column variables are not independent of each other. This implies that they are significantly associated.
Chi-square statistic can be easily computed using the function chisq.test() as follow:
chisq = chisq.test(chisq_dat_mat)
chisq
##
## Pearson's Chi-squared test
##
## data: chisq_dat_mat
## X-squared = 6561.4, df = 42, p-value < 2.2e-16
the row (ancestry groups) and the column (change in classification) variables are statistically significantly associated (p-value = 0).
The observed and the expected counts can be extracted from the result of the test as follow:
# Observed counts
chisq$observed
## Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr 867 219 772
## AF_amr 499 164 1141
## AF_asj 1599 251 272
## AF_eas 984 137 531
## AF_fin 844 173 206
## AF_nfe 2637 384 2262
## AF_oth 713 140 633
## AF_sas 983 205 882
## Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr 39392 107 7075
## AF_amr 20770 48 3988
## AF_asj 16661 74 3371
## AF_eas 20990 289 3986
## AF_fin 7067 108 1331
## AF_nfe 41109 151 8143
## AF_oth 15275 46 3077
## AF_sas 23239 55 4184
## Upgrade VUS to Path
## AF_afr 2399
## AF_amr 2251
## AF_asj 798
## AF_eas 1649
## AF_fin 646
## AF_nfe 6399
## AF_oth 1502
## AF_sas 2107
# Expected counts
round(chisq$expected,2)
## Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr 1813.57 332.47 1331.26
## AF_amr 1029.71 188.77 755.87
## AF_asj 821.53 150.60 603.05
## AF_eas 1019.19 186.84 748.14
## AF_fin 370.16 67.86 271.72
## AF_nfe 2179.42 399.54 1599.81
## AF_oth 763.02 139.88 560.10
## AF_sas 1129.40 207.04 829.04
## Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr 36665.45 174.48 6986.19
## AF_amr 20818.04 99.07 3966.65
## AF_asj 16609.13 79.04 3164.69
## AF_eas 20605.25 98.05 3926.10
## AF_fin 7483.70 35.61 1425.94
## AF_nfe 44061.87 209.68 8395.50
## AF_oth 15426.16 73.41 2939.28
## AF_sas 22833.40 108.66 4350.65
## Upgrade VUS to Path
## AF_afr 3527.58
## AF_amr 2002.90
## AF_asj 1597.96
## AF_eas 1982.43
## AF_fin 720.01
## AF_nfe 4239.18
## AF_oth 1484.15
## AF_sas 2196.80
Nature of the dependence between the ancestry groups and the change in classification variable
If you want to know the most contributing cells to the total Chi-square score, you just have to calculate the Chi-square statistic for each cell:
r = o − e /√e
The above formula returns the so-called Pearson residuals (r) for each cell (or standardized residuals).
Cells with the highest absolute standardized residuals contribute the most to the total Chi-square score.
For a given cell, the size of the circle is proportional to the amount of the cell contribution.
The sign of the standardized residuals is also very important to interpret the association between rows and columns as explained below.
1. Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
2. In the image below, it’s evident that there are an association between the column "conflicting/complex" and the rows AF_asj, AF_fin; Down Path-VUS and AF_amr, AF_nfe, Down VUS-Benign AF_afr; Up Benign-Path and AF_eas.
There is a strong positive association between the column Up VUS-Path and the row AF-nfe.
3. Negative residuals are in red. This implies a repulsion (negative association) between the corresponding row and column variables. For example the column Up VUS-Path are negatively associated (~ “not associated”) with the row AF-afr and AF_asj. There is a repulsion between the column Down Path-VUS and, the row AF_afr.
corrplot(chisq$residuals, is.corr = FALSE)
The contribution (in %) of a given cell to the total Chi-square score is calculated as follow contrib = r2/χ2
r is the residual of the cell
contrib <- 100*chisq$residuals^2/chisq$statistic
round(contrib, 3)
## Conflicting/Complex Downgrade Path to Benign Downgrade Path to VUS
## AF_afr 7.530 0.590 3.581
## AF_amr 4.169 0.050 2.991
## AF_asj 11.214 1.020 2.770
## AF_eas 0.019 0.203 0.961
## AF_fin 9.244 2.483 0.242
## AF_nfe 1.464 0.009 4.177
## AF_oth 0.050 0.000 0.145
## AF_sas 0.289 0.000 0.052
## Downgrade VUS to Benign Upgrade Benign to Path Upgrade Benign to VUS
## AF_afr 3.090 0.398 0.017
## AF_amr 0.002 0.401 0.002
## AF_asj 0.002 0.005 0.205
## AF_eas 0.109 5.667 0.014
## AF_fin 0.354 2.242 0.096
## AF_nfe 3.016 0.250 0.116
## AF_oth 0.023 0.156 0.098
## AF_sas 0.110 0.404 0.097
## Upgrade VUS to Path
## AF_afr 5.503
## AF_amr 0.468
## AF_asj 6.103
## AF_eas 0.855
## AF_fin 0.116
## AF_nfe 16.771
## AF_oth 0.003
## AF_sas 0.056
The relative contribution of each cell to the total Chi-square score give some indication of the nature of the dependency between rows and columns of the contingency table.
It can be seen that:
The column “Up VUS-Path” is strongly associated with AF_nfe.
corrplot(contrib, is.cor = FALSE)
From the image above, it can be seen that the most contributing cells to
the Chi-square is AF_nfe/Up VUS-Path
These cells contribute about 21.280 % to the total Chi-square score and thus account for most of the difference between expected and observed values. This confirms the earlier visual interpretation of the data. As stated earlier, visual interpretation may be complex when the contingency table is very large. In this case, the contribution of one cell to the total Chi-square score becomes a useful way of establishing the nature of dependency.
glm_dat = glm_dat %>% mutate(across(where(is.numeric), ~replace_na(., mean(., na.rm=TRUE))))
we recode the classification, where “No change” is removed
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
# distinct(variation_id, .keep_all = TRUE) %>%
# filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
mutate(class = if_else(class_merged == "Upgrade VUS to Path", "Upgrade VUS to Path", "other class")) %>%
group_by(class, ethnic_group_AFmax) %>%
summarise(count=n()) %>% ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
filter(!is.na(class), class != "other class") %>%
ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) +
ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
##prepare count data
lr_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Upgrade VUS to Path", "Upgrade VUS to Path", "other class")) %>%
rename(ancestry_max_AF = ethnic_group_AFmax) %>%
# mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
data = DF,
family = "binomial"
)
# print results
summary(m2)
##
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial",
## data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.4704 -0.4030 -0.3712 -0.3110 2.5932
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.14547 0.01321 -162.39 <2e-16 ***
## factor(ancestry_max_AF)AF_afr -0.85964 0.02474 -34.75 <2e-16 ***
## factor(ancestry_max_AF)AF_amr -0.32445 0.02562 -12.66 <2e-16 ***
## factor(ancestry_max_AF)AF_asj -1.18153 0.03838 -30.79 <2e-16 ***
## factor(ancestry_max_AF)AF_eas -0.64712 0.02860 -22.62 <2e-16 ***
## factor(ancestry_max_AF)AF_fin -0.56660 0.04272 -13.26 <2e-16 ***
## factor(ancestry_max_AF)AF_oth -0.43765 0.02984 -14.66 <2e-16 ***
## factor(ancestry_max_AF)AF_sas -0.49528 0.02613 -18.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 128956 on 255784 degrees of freedom
## Residual deviance: 126860 on 255777 degrees of freedom
## AIC: 126876
##
## Number of Fisher Scoring iterations: 6
we recode the classification, where “No change” is removed
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
# distinct(variation_id, .keep_all = TRUE) %>%
# filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
mutate(class = if_else(class_merged == "Upgrade Benign to VUS", "Upgrade Benign to VUS", "other class")) %>%
group_by(class, ethnic_group_AFmax) %>%
summarise(count=n()) %>% ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
filter(!is.na(class), class != "other class") %>%
ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) +
ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Upgrade Benign to VUS", "Upgrade Benign to VUS", "other class")) %>%
rename(ancestry_max_AF = ethnic_group_AFmax) %>%
# mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
data = DF,
family = "binomial"
)
# print results
summary(m2)
##
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial",
## data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5627 -0.5475 -0.5454 -0.5325 2.0266
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.872038 0.011903 -157.268 < 2e-16 ***
## factor(ancestry_max_AF)AF_afr 0.049977 0.017490 2.858 0.004270 **
## factor(ancestry_max_AF)AF_amr 0.041545 0.020800 1.997 0.045787 *
## factor(ancestry_max_AF)AF_asj 0.108916 0.022118 4.924 8.47e-07 ***
## factor(ancestry_max_AF)AF_eas 0.052893 0.020815 2.541 0.011049 *
## factor(ancestry_max_AF)AF_fin -0.044133 0.031679 -1.393 0.163586
## factor(ancestry_max_AF)AF_oth 0.088601 0.022832 3.881 0.000104 ***
## factor(ancestry_max_AF)AF_sas -0.009825 0.020423 -0.481 0.630468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 204776 on 255784 degrees of freedom
## Residual deviance: 204726 on 255777 degrees of freedom
## AIC: 204742
##
## Number of Fisher Scoring iterations: 4
we recode the classification, where “No change” is removed
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
# distinct(variation_id, .keep_all = TRUE) %>%
# filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
mutate(class = if_else(class_merged == "Upgrade Benign to Path", "Upgrade Benign to Path", "other class")) %>%
group_by(class, ethnic_group_AFmax) %>%
summarise(count=n()) %>% ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
filter(!is.na(class), class != "other class") %>%
ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) +
ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Upgrade Benign to Path", "Upgrade Benign to Path", "other class")) %>%
rename(ancestry_max_AF = ethnic_group_AFmax) %>%
# mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
data = DF,
family = "binomial"
)
# print results
summary(m2)
##
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial",
## data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.1447 -0.0704 -0.0656 -0.0649 3.5774
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.00027 0.08148 -73.641 <2e-16 ***
## factor(ancestry_max_AF)AF_afr -0.16106 0.12651 -1.273 0.2030
## factor(ancestry_max_AF)AF_amr -0.39711 0.16585 -2.394 0.0166 *
## factor(ancestry_max_AF)AF_asj 0.26317 0.14211 1.852 0.0640 .
## factor(ancestry_max_AF)AF_eas 1.41689 0.10067 14.075 <2e-16 ***
## factor(ancestry_max_AF)AF_fin 1.44571 0.12647 11.431 <2e-16 ***
## factor(ancestry_max_AF)AF_oth -0.13943 0.16860 -0.827 0.4082
## factor(ancestry_max_AF)AF_sas -0.35331 0.15765 -2.241 0.0250 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11717 on 255784 degrees of freedom
## Residual deviance: 11252 on 255777 degrees of freedom
## AIC: 11268
##
## Number of Fisher Scoring iterations: 9
we recode the classification, where “No change” is removed
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
# distinct(variation_id, .keep_all = TRUE) %>%
# filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
mutate(class = if_else(class_merged == "Downgrade VUS to Benign", "Downgrade VUS to Benign", "other class")) %>%
group_by(class, ethnic_group_AFmax) %>%
summarise(count=n()) %>% ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
filter(!is.na(class), class != "other class") %>%
ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) +
ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Downgrade VUS to Benign", "Downgrade VUS to Benign", "other class")) %>%
rename(ancestry_max_AF = ethnic_group_AFmax) %>%
# mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
data = DF,
family = "binomial"
)
# print results
summary(m2)
##
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial",
## data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8900 -0.8112 -0.7851 1.4951 1.7271
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.721696 0.008625 -83.678 <2e-16 ***
## factor(ancestry_max_AF)AF_afr -0.514839 0.013682 -37.629 <2e-16 ***
## factor(ancestry_max_AF)AF_amr -0.221062 0.015688 -14.091 <2e-16 ***
## factor(ancestry_max_AF)AF_asj -0.240561 0.017074 -14.089 <2e-16 ***
## factor(ancestry_max_AF)AF_eas -0.297365 0.015938 -18.658 <2e-16 ***
## factor(ancestry_max_AF)AF_fin -0.037397 0.022764 -1.643 0.1
## factor(ancestry_max_AF)AF_oth -0.194432 0.017421 -11.161 <2e-16 ***
## factor(ancestry_max_AF)AF_sas -0.294002 0.015370 -19.128 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 302697 on 255784 degrees of freedom
## Residual deviance: 301110 on 255777 degrees of freedom
## AIC: 301126
##
## Number of Fisher Scoring iterations: 4
we recode the classification, where “No change” is removed
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
# distinct(variation_id, .keep_all = TRUE) %>%
# filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
mutate(class = if_else(class_merged == "Downgrade Path to VUS", "Downgrade Path to VUS", "other class")) %>%
group_by(class, ethnic_group_AFmax) %>%
summarise(count=n()) %>% ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
filter(!is.na(class), class != "other class") %>%
ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) +
ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Downgrade Path to VUS", "Downgrade Path to VUS", "other class")) %>%
rename(ancestry_max_AF = ethnic_group_AFmax) %>%
# mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
data = DF,
family = "binomial"
)
# print results
summary(m2)
##
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial",
## data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9795 0.1749 0.2377 0.2747 0.2840
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.25828 0.02143 152.069 < 2e-16 ***
## factor(ancestry_max_AF)AF_afr 0.91369 0.04212 21.691 < 2e-16 ***
## factor(ancestry_max_AF)AF_amr -0.06803 0.03703 -1.837 0.0662 .
## factor(ancestry_max_AF)AF_asj 1.16841 0.06465 18.073 < 2e-16 ***
## factor(ancestry_max_AF)AF_eas 0.70816 0.04876 14.522 < 2e-16 ***
## factor(ancestry_max_AF)AF_fin 0.64094 0.07356 8.713 < 2e-16 ***
## factor(ancestry_max_AF)AF_oth 0.23169 0.04568 5.072 3.94e-07 ***
## factor(ancestry_max_AF)AF_sas 0.29392 0.04032 7.290 3.09e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 62022 on 255784 degrees of freedom
## Residual deviance: 60977 on 255777 degrees of freedom
## AIC: 60993
##
## Number of Fisher Scoring iterations: 7
we recode the classification, where “No change” is removed
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
# distinct(variation_id, .keep_all = TRUE) %>%
# filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
mutate(class = if_else(class_merged == "Downgrade Path to Benign", "Downgrade Path to Benign", "other class")) %>%
group_by(class, ethnic_group_AFmax) %>%
summarise(count=n()) %>% ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
filter(!is.na(class), class != "other class") %>%
ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) +
ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Downgrade Path to Benign", "Downgrade Path to Benign", "other class")) %>%
rename(ancestry_max_AF = ethnic_group_AFmax) %>%
# mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
data = DF,
family = "binomial"
)
# print results
summary(m2)
##
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial",
## data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3007 0.0981 0.1123 0.1140 0.1834
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.06307 0.05119 98.903 < 2e-16 ***
## factor(ancestry_max_AF)AF_afr 0.37980 0.08489 4.474 7.68e-06 ***
## factor(ancestry_max_AF)AF_amr 0.10161 0.09356 1.086 0.27746
## factor(ancestry_max_AF)AF_asj -0.55511 0.08154 -6.808 9.91e-12 ***
## factor(ancestry_max_AF)AF_eas 0.27211 0.09978 2.727 0.00639 **
## factor(ancestry_max_AF)AF_fin -0.98603 0.09219 -10.696 < 2e-16 ***
## factor(ancestry_max_AF)AF_oth -0.04079 0.09905 -0.412 0.68046
## factor(ancestry_max_AF)AF_sas -0.02993 0.08678 -0.345 0.73018
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20164 on 255784 degrees of freedom
## Residual deviance: 19931 on 255777 degrees of freedom
## AIC: 19947
##
## Number of Fisher Scoring iterations: 8
we recode the classification, where “No change” is removed
clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
# distinct(variation_id, .keep_all = TRUE) %>%
# filter(!is.na(Classification), Classification != "No Change", Classification == "Upgrade VUS to Path") %>%
mutate(class = if_else(class_merged == "Conflicting/Complex", "Conflicting/Complex", "other class")) %>%
group_by(class, ethnic_group_AFmax) %>%
summarise(count=n()) %>% ungroup() %>%
group_by(ethnic_group_AFmax) %>%
mutate(percent_count = count/sum(count) * 100, total = sum(percent_count)) %>%
filter(!is.na(class), class != "other class") %>%
ggplot(aes(x = class, y = percent_count, fill = ethnic_group_AFmax)) +
geom_bar(stat="identity", color="black", position=position_dodge()) +
theme_classic() +
theme(axis.text.x = element_text(angle = 0), legend.title=element_blank()) +
ylab(label = "percent count") + xlab(label = "Classification")
## `summarise()` has grouped output by 'class'. You can override using the
## `.groups` argument.
##prepare count data
lr_dat2 = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
filter(!is.na(class_merged), class_merged != "No Change") %>%
mutate(class = if_else(class_merged == "Conflicting/Complex", "Conflicting/Complex", "other class")) %>%
rename(ancestry_max_AF = ethnic_group_AFmax) %>%
# mutate() %>%
select(class, ancestry_max_AF)
#DF <- within(lr_dat, ancestry_max_AF <- relevel(ancestry_max_AF, ref = 6))
DF = lr_dat2 %>% mutate(ancestry_max_AF = relevel(factor(ancestry_max_AF), 6))
# save model
m2 <- glm(factor(class) ~ factor(ancestry_max_AF),
data = DF,
family = "binomial"
)
# print results
summary(m2)
##
## Call:
## glm(formula = factor(class) ~ factor(ancestry_max_AF), family = "binomial",
## data = DF)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8535 0.1868 0.2604 0.2971 0.4119
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.09850 0.01991 155.641 < 2e-16 ***
## factor(ancestry_max_AF)AF_afr 0.95552 0.03961 24.120 < 2e-16 ***
## factor(ancestry_max_AF)AF_amr 0.94170 0.04934 19.084 < 2e-16 ***
## factor(ancestry_max_AF)AF_asj -0.50322 0.03269 -15.396 < 2e-16 ***
## factor(ancestry_max_AF)AF_eas 0.23480 0.03806 6.169 6.89e-10 ***
## factor(ancestry_max_AF)AF_fin -0.67434 0.04106 -16.423 < 2e-16 ***
## factor(ancestry_max_AF)AF_oth 0.26861 0.04298 6.250 4.11e-10 ***
## factor(ancestry_max_AF)AF_sas 0.34200 0.03803 8.993 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 78760 on 255784 degrees of freedom
## Residual deviance: 76593 on 255777 degrees of freedom
## AIC: 76609
##
## Number of Fisher Scoring iterations: 6
library(ggsankey)
plot_dat = clinvar_gnomad_by_ancestry_max_0714_gene_diseases_missing %>%
select(submitted_classification, class_merged)
plot_dat_lg = plot_dat %>% filter(class_merged != "No Change") %>% make_long(submitted_classification, class_merged)
pl <- ggplot(plot_dat_lg, aes(x = x,
next_x = next_x,
node = node,
next_node = next_node,
fill = factor(node),
label = node)) # This Creates a label for each node
pl <- pl +geom_sankey(flow.alpha = 0.5, #This Creates the transparency of your node
node.color = "black", # This is your node color
show.legend = TRUE) # This determines if you want your legend to show
pl <- pl + geom_sankey_label(Size = 3,
color = "black",
fill = "white") # This specifies the Label format for each node
## Warning in geom_sankey_label(Size = 3, color = "black", fill = "white"):
## Ignoring unknown parameters: `Size`
pl <- pl + theme_bw()
pl <- pl + theme(legend.position = 'none')
pl <- pl + theme(axis.title = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())
pl <- pl + scale_fill_viridis_d(option = "inferno")
pl <- pl + labs(title = "reclassification of variants")
pl <- pl + labs(subtitle = "CLINVAR: 05-20-2023 release")
pl <- pl + labs(caption ="Trust Odia and Theodore Drivas" )
pl <- pl + labs(fill = 'Nodes')
pl